Pandas

Disclaimer: this course is adapted from the work Pandas tutorial by Joris Van den Bossche. R users might also want to read Pandas: Comparison with R / R libraries for a smooth start in Pandas.

%matplotlib inline
import os
import numpy as np
import calendar
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pooch  # download data / avoid re-downloading
from IPython import get_ipython


sns.set_palette("colorblind")
pd.options.display.max_rows = 8

First case study: Titanic dataset

First, it is important to download automatically remote files for reproducibility (and avoid typing names manually)

url = "http://josephsalmon.eu/enseignement/datasets/titanic.csv"
path_target = "./titanic.csv"
path, fname = os.path.split(path_target)
pooch.retrieve(url, path=path, fname=fname, known_hash=None)  # if needed `pip install pooch`
'/home/bcharlier/enseignement/HAX712/HAX712X/Courses/Pandas/titanic.csv'

Reading the file as a pandas dataframe:

df_titanic_raw = pd.read_csv("titanic.csv")

Visualize the end of the dataset:

df_titanic_raw.tail(n=3)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.45 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.00 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.75 NaN Q

Visualize the beginning of the dataset:

df_titanic_raw.head(n=5)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Missing values

It is common to encounter features/covariates with missing values. In pandas they were mostly handled as np.nan (not a number). In the future, they will be treated as NA (Not Available), in a similar way as in R; see the Pandas documentation on missing data for standard behavior and details.

Note that the main difference between pd.NA and np.nan is that pd.NA propagates even for comparisons:

pd.NA == 1
<NA>

whereas

np.nan == 1
False

Testing the presence of missing values

pd.isna(pd.NA)
pd.isna(np.nan)
True

The simplest strategy (when you can / when you have enough samples) consists of removing all nans/NAs.

df_titanic = df_titanic_raw.dropna()
df_titanic.tail(3)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
879 880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 1 11767 83.1583 C50 C
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
# Useful info on the dataset (especially missing values!)
df_titanic.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 183 entries, 1 to 889
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  183 non-null    int64  
 1   Survived     183 non-null    int64  
 2   Pclass       183 non-null    int64  
 3   Name         183 non-null    object 
 4   Sex          183 non-null    object 
 5   Age          183 non-null    float64
 6   SibSp        183 non-null    int64  
 7   Parch        183 non-null    int64  
 8   Ticket       183 non-null    object 
 9   Fare         183 non-null    float64
 10  Cabin        183 non-null    object 
 11  Embarked     183 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 18.6+ KB
# Check that the `Cabin` information is mostly missing; the same hold for `Age`
df_titanic_raw.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

Description of the titanic.csv dataset

Details of the dataset are given here

  • Survived: Survival 0 = No, 1 = Yes
  • Pclass: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
  • Sex: Sex male/female
  • Age: Age in years
  • Sibsp: # of siblings/spouses aboard the Titanic
  • Parch: # of parents/children aboard the Titanic
  • Ticket: Ticket number
  • Fare: Passenger fare
  • Cabin: Cabin number
  • Embarked: Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
  • Name: Name of the passenger
  • PassengerId: Number to identify passenger
Note

For those interested, an extended version of the dataset is available here https://biostat.app.vumc.org/wiki/pub/Main/DataSets/titanic.txt.

Simple descriptive statistics

df_titanic.describe()
PassengerId Survived Pclass Age SibSp Parch Fare
count 183.000000 183.000000 183.000000 183.000000 183.000000 183.000000 183.000000
mean 455.366120 0.672131 1.191257 35.674426 0.464481 0.475410 78.682469
std 247.052476 0.470725 0.515187 15.643866 0.644159 0.754617 76.347843
min 2.000000 0.000000 1.000000 0.920000 0.000000 0.000000 0.000000
25% 263.500000 0.000000 1.000000 24.000000 0.000000 0.000000 29.700000
50% 457.000000 1.000000 1.000000 36.000000 0.000000 0.000000 57.000000
75% 676.000000 1.000000 1.000000 47.500000 1.000000 1.000000 90.000000
max 890.000000 1.000000 3.000000 80.000000 3.000000 4.000000 512.329200

Visualization

  • Histograms (please avoid…often useless)
fig, ax = plt.subplots(1, 1, figsize=(5, 5))
ax.hist(df_titanic['Age'], density=True, bins=25)
plt.xlabel('Age')
plt.ylabel('Proportion')
plt.title("Passager age histogram")
plt.show()

  • Kernel Density Estimate (KDE): :
fig, ax = plt.subplots(1, 1, figsize=(5, 5))
sns.kdeplot(
    df_titanic["Age"], ax=ax, fill=True, cut=0, bw_adjust=0.1
)
plt.xlabel("Proportion")
plt.ylabel("Age")
plt.title("Passager age kernel density estimate")
plt.tight_layout()
plt.show()

Note

Note: the bandwidth parameter (here encoded as bw_adjust) controls the smoothing level. It is a common parameter in kernel smoothing, investigated thoroughly in the non-parametric statistics literature.

  • Swarmplot:
fig, ax = plt.subplots(1, 1, figsize=(5, 5))
sns.swarmplot(
    data=df_titanic_raw,
    ax=ax,
    x="Sex",
    y="Age",
    hue="Survived",
    palette={0: "r", 1: "k"},
    order=["female", "male"],
)
plt.title("Passager age by gender/survival")
plt.legend(labels=["Died", "Survived"], loc="upper left")
plt.tight_layout()
plt.show()
/home/bcharlier/.conda/envs/hax712/lib/python3.9/site-packages/seaborn/categorical.py:3544: UserWarning:

6.8% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.

EXERCISE density over histogram

Plot the density estimate over the histogram

Interactivity

Interactive interaction with codes and output is nowadays getting easier and easier (see also Shiny app in R-software). In Python, one can use plotly, or widgets and the interact package for this purpose. We are going to visualize that on the simple KDE and histogram examples.

import plotly.graph_objects as go
import numpy as np
from scipy import stats

# Create figure
fig = go.FigureWidget()

# Add traces, one for each slider step
bws = np.arange(0.01, 0.51, 0.01)
xx = np.arange(0, 100, 0.5)
for step in bws:
    kernel = stats.gaussian_kde(df_titanic['Age'], bw_method=step)
    yy = kernel(xx)
    fig.add_trace(
        go.Scatter(
            visible=False,
            fill='tozeroy',
            fillcolor='rgba(67, 67, 67, 0.5)',
            line=dict(color="black", width=2),
            name=f"Bw = {step:.2f}",
            x=xx,
            y=yy))

# Make 10th trace visible
fig.data[10].visible = True

# Create and add slider
steps = []
for i in range(len(fig.data)):
    step = dict(
        method="update",
        args=[{"visible": [False] * len(fig.data)},
            #   {"title": f"{fig.data[i].name}"}
              ],
              label=f"{bws[i]:.2f}"
              )
    step["args"][0]["visible"][i] = True  # Toggle i'th trace to "visible"
    steps.append(step)
sliders = [dict(
    active=10,
    currentvalue={"prefix": "Bandwidth = "},
    pad={"t": 50},
    steps=steps
)]

fig.update_layout(
    sliders=sliders,
    template="simple_white",
    title=f"Impact of the bandwidth on the KDE",
    showlegend=False,
)

fig.show()
import plotly.graph_objects as go
import numpy as np
from scipy import stats

# Create figure
fig = go.FigureWidget()

# Add traces, one for each slider step
bws = np.arange(0.01, 5.51, 0.3)
xx = np.arange(0, 100, 0.5)

for step in bws:
    fig.add_trace(
        go.Histogram(
            x=df_titanic['Age'],
            xbins=dict( # bins used for histogram
            start=xx.min(),
            end=xx.max(),
            size=step
            ),
            autobinx = False,
            visible=False,
            marker_color="rgba(67, 67, 67, 0.5)",
            name=f"Bandwidth = {step:.2f}"
            ))

# Make 10th trace visible
fig.data[10].visible = True

# Create and add slider
steps = []
for i in range(len(fig.data)):
    step = dict(
        method="update",
        args=[{"visible": [False] * len(fig.data)},
            #   {"title": f"{fig.data[i].name}"}
              ],
              label=f"{bws[i]:.2f}"
              )
    step["args"][0]["visible"][i] = True  # Toggle i'th trace to "visible"
    steps.append(step)
sliders = [dict(
    active=10,
    currentvalue={"prefix": "Bandwidth = "},
    pad={"t": 50},
    steps=steps
)]

fig.update_layout(
    sliders=sliders,
    template="simple_white",
    title=f"Impact of the bandwidth on histograms",
    showlegend=False,
)

fig.show()

IPython widgets (in Jupyter notebooks)

import ipywidgets as widgets
from ipywidgets import interact, fixed

def hist_explore(
    dataset=df_titanic,
    variable=df_titanic.columns,
    n_bins=24,
    alpha=0.25,
    density=False,
):
    fig, ax = plt.subplots(1, 1, figsize=(5, 5))
    ax.hist(
        dataset[variable], density=density, bins=n_bins, alpha=alpha
    )  # standardization
    plt.ylabel("Density level")
    plt.title(f"Dataset {dataset.attrs['name']}:\n Histogram for passengers' age")
    plt.tight_layout()
    plt.show()


interact(
    hist_explore,
    dataset=fixed(df_titanic),
    n_bins=(1, 50, 1),
    alpha=(0, 1, 0.1),
    density=False,
)
def kde_explore(dataset=df_titanic, variable=df_titanic.columns, bw=5):
    fig, ax = plt.subplots(1, 1, figsize=(5, 5))
    sns.kdeplot(dataset[variable], bw_adjust=bw, fill=True, cut=0, ax=ax)
    plt.ylabel("Density level")
    plt.title(f"Dataset:\n KDE for passengers'  {variable}")
    plt.tight_layout()
    plt.show()

interact(kde_explore, dataset=fixed(df_titanic), bw=(0.001, 2, 0.01))

Groupby function

How does the survival rate change w.r.t. to sex?

df_titanic_raw.groupby('Sex')[['Survived']].aggregate(lambda x: x.mean())
Survived
Sex
female 0.742038
male 0.188908

How does the survival rate change w.r.t. the class?

df_titanic.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
fig, ax = plt.subplots(1, 1, figsize=(5, 5))

df_titanic.groupby('Pclass')['Survived'].aggregate(lambda x:
                                                   x.mean()).plot(ax=ax,kind='bar')
plt.xlabel('Classe')
plt.ylabel('Taux de survie')
plt.title('Taux de survie par classe')
plt.show()

EXERCISE median by class

Perform a similar analysis with the median for the price per class in pounds.

catplot: a visual groupby

ax=sns.catplot(
    x="Pclass",
    y="Age",
    hue="Sex",
    palette={'female': 'red', 'male': 'b'},
    data=df_titanic_raw,
    jitter = '0.2',
    s=8,
)
sns.move_legend(ax, "upper left", bbox_to_anchor=(0.8, 0.8))
plt.show()

ax=sns.catplot(
    x="Pclass",
    y="Age",
    hue="Sex",
    palette={'female': 'red', 'male': 'b'},
    alpha=0.8,
    data=df_titanic_raw,
    kind='swarm',
    s=11,
)
sns.move_legend(ax, "upper left", bbox_to_anchor=(0.8, 0.8))
plt.show()

ax=sns.catplot(
    x="Sex",
    y="Age",
    hue="Sex",
    palette={'female': 'red', 'male': 'b'},
    col='Pclass',
    alpha=0.8,
    data=df_titanic_raw,
    kind='swarm',
    s=6,
    height=5,
    aspect=0.35
)
plt.show()

ax=sns.catplot(x='Pclass',
            y='Age',
            hue="Sex",
            palette={'female': 'red', 'male': 'b'},
            data=df_titanic_raw,
            kind="violin",
            alpha=0.8,
)
sns.move_legend(ax, "upper left", bbox_to_anchor=(0.8, 0.8))
plt.show()

Beware: large difference in sex ratio by class

df_titanic_raw.groupby(['Sex', 'Pclass'])[['Sex']].count()
df_titanic_raw.groupby(['Sex'])[['Sex']].count()
Sex
Sex
female 314
male 577

Consider checking the raw data, as often boxplots or simple statistics are not enough to understand the diversity inside the data; see for instance the discussion here: https://sigmoid.social/@ct_bergstrom@fediscience.org/110267907203021857

References:

pd.crosstab

pd.crosstab(
    df_titanic_raw["Sex"],
    df_titanic_raw["Pclass"],
    values=df_titanic_raw["Sex"],
    aggfunc="count",
    normalize=False,
)
Pclass 1 2 3
Sex
female 94 76 144
male 122 108 347
df_titanic
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S
... ... ... ... ... ... ... ... ... ... ... ... ...
872 873 0 1 Carlsson, Mr. Frans Olof male 33.0 0 0 695 5.0000 B51 B53 B55 S
879 880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 1 11767 83.1583 C50 C
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C

183 rows × 12 columns

df_titanic.index
Int64Index([  1,   3,   6,  10,  11,  21,  23,  27,  52,  54,
            ...
            835, 853, 857, 862, 867, 871, 872, 879, 887, 889],
           dtype='int64', length=183)
df_titanic.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
pd.options.display.max_rows = 12
df_titanic.dtypes

df_titanic['Name'].astype(str)
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
6                                McCarthy, Mr. Timothy J
10                       Sandstrom, Miss. Marguerite Rut
11                              Bonnell, Miss. Elizabeth
                             ...                        
871     Beckwith, Mrs. Richard Leonard (Sallie Monypeny)
872                             Carlsson, Mr. Frans Olof
879        Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)
887                         Graham, Miss. Margaret Edith
889                                Behr, Mr. Karl Howell
Name: Name, Length: 183, dtype: object

Extract numpy arrays from dataframes

useful for using packages on top of pandas (e.g., sklearn, though nowadays it works out of the box with pandas)

array_titanic = df_titanic.values  # associated numpy array
array_titanic
array([[2, 1, 1, ..., 71.2833, 'C85', 'C'],
       [4, 1, 1, ..., 53.1, 'C123', 'S'],
       [7, 0, 1, ..., 51.8625, 'E46', 'S'],
       ...,
       [880, 1, 1, ..., 83.1583, 'C50', 'C'],
       [888, 1, 1, ..., 30.0, 'B42', 'S'],
       [890, 1, 1, ..., 30.0, 'C148', 'C']], dtype=object)
EXERCISE: dropna

Perform the following operation: remove the columns Cabin from the raw dataset, and then remove the rows with the variable Age missing.

Hint: check the ‘dropna’ documentation.

1D dataset: Series (a column of a DataFrame)

A Series is a labeled 1D column of a kind.

fare = df_titanic['Fare']
fare
1      71.2833
3      53.1000
6      51.8625
10     16.7000
11     26.5500
        ...   
871    52.5542
872     5.0000
879    83.1583
887    30.0000
889    30.0000
Name: Fare, Length: 183, dtype: float64

Attributes Series: indices and values

fare.values[:10]
array([ 71.2833,  53.1   ,  51.8625,  16.7   ,  26.55  ,  13.    ,
        35.5   , 263.    ,  76.7292,  61.9792])

Contrarily to numpy arrays, you can index with other formats than integers:

# Be careful, what follows changes the indexing
df_titanic_raw = df_titanic_raw.set_index('Name')
df_titanic_raw
PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
Name
Braund, Mr. Owen Harris 1 0 3 male 22.0 1 0 A/5 21171 7.2500 NaN S
Cumings, Mrs. John Bradley (Florence Briggs Thayer) 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
Heikkinen, Miss. Laina 3 1 3 female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
Futrelle, Mrs. Jacques Heath (Lily May Peel) 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
Allen, Mr. William Henry 5 0 3 male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ...
Montvila, Rev. Juozas 887 0 2 male 27.0 0 0 211536 13.0000 NaN S
Graham, Miss. Margaret Edith 888 1 1 female 19.0 0 0 112053 30.0000 B42 S
Johnston, Miss. Catherine Helen "Carrie" 889 0 3 female NaN 1 2 W./C. 6607 23.4500 NaN S
Behr, Mr. Karl Howell 890 1 1 male 26.0 0 0 111369 30.0000 C148 C
Dooley, Mr. Patrick 891 0 3 male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 11 columns

age = df_titanic_raw['Age']
age['Behr, Mr. Karl Howell']
26.0
age.mean()
29.69911764705882
df_titanic_raw[age < 2]
PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
Name
Caldwell, Master. Alden Gates 79 1 2 male 0.83 0 2 248738 29.0000 NaN S
Panula, Master. Eino Viljami 165 0 3 male 1.00 4 1 3101295 39.6875 NaN S
Johnson, Miss. Eleanor Ileen 173 1 3 female 1.00 1 1 347742 11.1333 NaN S
Becker, Master. Richard F 184 1 2 male 1.00 2 1 230136 39.0000 F4 S
Allison, Master. Hudson Trevor 306 1 1 male 0.92 1 2 113781 151.5500 C22 C26 S
... ... ... ... ... ... ... ... ... ... ... ...
Hamalainen, Master. Viljo 756 1 2 male 0.67 1 1 250649 14.5000 NaN S
Dean, Master. Bertram Vere 789 1 3 male 1.00 1 2 C.A. 2315 20.5750 NaN S
Thomas, Master. Assad Alexander 804 1 3 male 0.42 0 1 2625 8.5167 NaN C
Mallet, Master. Andre 828 1 2 male 1.00 0 2 S.C./PARIS 2079 37.0042 NaN C
Richards, Master. George Sibley 832 1 2 male 0.83 1 1 29106 18.7500 NaN S

14 rows × 11 columns

# You can come back to the original indexing
df_titanic_raw = df_titanic_raw.reset_index()

Counting values for categorical variables

df_titanic_raw['Embarked'].value_counts(normalize=False, sort=True,
                                        ascending=False)
S    644
C    168
Q     77
Name: Embarked, dtype: int64
pd.options.display.max_rows = 70
df_titanic[df_titanic['Embarked'] == 'C']
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
52 53 1 1 Harper, Mrs. Henry Sleeper (Myna Haxtun) female 49.0 1 0 PC 17572 76.7292 D33 C
54 55 0 1 Ostby, Mr. Engelhart Cornelius male 65.0 0 1 113509 61.9792 B30 C
96 97 0 1 Goldschmidt, Mr. George B male 71.0 0 0 PC 17754 34.6542 A5 C
97 98 1 1 Greenfield, Mr. William Bertram male 23.0 0 1 PC 17759 63.3583 D10 D12 C
118 119 0 1 Baxter, Mr. Quigg Edmond male 24.0 0 1 PC 17558 247.5208 B58 B60 C
139 140 0 1 Giglio, Mr. Victor male 24.0 0 0 PC 17593 79.2000 B86 C
174 175 0 1 Smith, Mr. James Clinch male 56.0 0 0 17764 30.6958 A7 C
177 178 0 1 Isham, Miss. Ann Elizabeth female 50.0 0 0 PC 17595 28.7125 C49 C
194 195 1 1 Brown, Mrs. James Joseph (Margaret Tobin) female 44.0 0 0 PC 17610 27.7208 B4 C
195 196 1 1 Lurette, Miss. Elise female 58.0 0 0 PC 17569 146.5208 B80 C
209 210 1 1 Blank, Mr. Henry male 40.0 0 0 112277 31.0000 A31 C
215 216 1 1 Newell, Miss. Madeleine female 31.0 1 0 35273 113.2750 D36 C
218 219 1 1 Bazzani, Miss. Albina female 32.0 0 0 11813 76.2917 D15 C
273 274 0 1 Natsch, Mr. Charles H male 37.0 0 1 PC 17596 29.7000 C118 C
291 292 1 1 Bishop, Mrs. Dickinson H (Helen Walton) female 19.0 1 0 11967 91.0792 B49 C
292 293 0 2 Levy, Mr. Rene Jacques male 36.0 0 0 SC/Paris 2163 12.8750 D C
299 300 1 1 Baxter, Mrs. James (Helene DeLaudeniere Chaput) female 50.0 0 1 PC 17558 247.5208 B58 B60 C
307 308 1 1 Penasco y Castellana, Mrs. Victor de Satode (M... female 17.0 1 0 PC 17758 108.9000 C65 C
309 310 1 1 Francatelli, Miss. Laura Mabel female 30.0 0 0 PC 17485 56.9292 E36 C
310 311 1 1 Hays, Miss. Margaret Bechstein female 24.0 0 0 11767 83.1583 C54 C
311 312 1 1 Ryerson, Miss. Emily Borie female 18.0 2 2 PC 17608 262.3750 B57 B59 B63 B66 C
319 320 1 1 Spedden, Mrs. Frederic Oakley (Margaretta Corn... female 40.0 1 1 16966 134.5000 E34 C
325 326 1 1 Young, Miss. Marie Grice female 36.0 0 0 PC 17760 135.6333 C32 C
329 330 1 1 Hippach, Miss. Jean Gertrude female 16.0 0 1 111361 57.9792 B18 C
337 338 1 1 Burns, Miss. Elizabeth Margaret female 41.0 0 0 16966 134.5000 E40 C
366 367 1 1 Warren, Mrs. Frank Manley (Anna Sophia Atkinson) female 60.0 1 0 110813 75.2500 D37 C
369 370 1 1 Aubart, Mme. Leontine Pauline female 24.0 0 0 PC 17477 69.3000 B35 C
370 371 1 1 Harder, Mr. George Achilles male 25.0 1 0 11765 55.4417 E50 C
377 378 0 1 Widener, Mr. Harry Elkins male 27.0 0 2 113503 211.5000 C82 C
393 394 1 1 Newell, Miss. Marjorie female 23.0 1 0 35273 113.2750 D36 C
452 453 0 1 Foreman, Mr. Benjamin Laventall male 30.0 0 0 113051 27.7500 C111 C
453 454 1 1 Goldenberg, Mr. Samuel L male 49.0 1 0 17453 89.1042 C92 C
473 474 1 2 Jerwan, Mrs. Amin S (Marie Marthe Thuillard) female 23.0 0 0 SC/AH Basle 541 13.7917 D C
484 485 1 1 Bishop, Mr. Dickinson H male 25.0 1 0 11967 91.0792 B49 C
487 488 0 1 Kent, Mr. Edward Austin male 58.0 0 0 11771 29.7000 B37 C
496 497 1 1 Eustis, Miss. Elizabeth Mussey female 54.0 1 0 36947 78.2667 D20 C
505 506 0 1 Penasco y Castellana, Mr. Victor de Satode male 18.0 1 0 PC 17758 108.9000 C65 C
523 524 1 1 Hippach, Mrs. Louis Albert (Ida Sophia Fischer) female 44.0 0 1 111361 57.9792 B18 C
539 540 1 1 Frolicher, Miss. Hedwig Margaritha female 22.0 0 2 13568 49.5000 B39 C
544 545 0 1 Douglas, Mr. Walter Donald male 50.0 1 0 PC 17761 106.4250 C86 C
550 551 1 1 Thayer, Mr. John Borland Jr male 17.0 0 2 17421 110.8833 C70 C
556 557 1 1 Duff Gordon, Lady. (Lucille Christiana Sutherl... female 48.0 1 0 11755 39.6000 A16 C
581 582 1 1 Thayer, Mrs. John Borland (Marian Longstreth M... female 39.0 1 1 17421 110.8833 C68 C
583 584 0 1 Ross, Mr. John Hugo male 36.0 0 0 13049 40.1250 A10 C
587 588 1 1 Frolicher-Stehli, Mr. Maxmillian male 60.0 1 1 13567 79.2000 B41 C
591 592 1 1 Stephenson, Mrs. Walter Bertram (Martha Eustis) female 52.0 1 0 36947 78.2667 D20 C
599 600 1 1 Duff Gordon, Sir. Cosmo Edmund ("Mr Morgan") male 49.0 1 0 PC 17485 56.9292 A20 C
632 633 1 1 Stahelin-Maeglin, Dr. Max male 32.0 0 0 13214 30.5000 B50 C
641 642 1 1 Sagesser, Mlle. Emma female 24.0 0 0 PC 17477 69.3000 B35 C
645 646 1 1 Harper, Mr. Henry Sleeper male 48.0 1 0 PC 17572 76.7292 D33 C
647 648 1 1 Simonius-Blumer, Col. Oberst Alfons male 56.0 0 0 13213 35.5000 A26 C
659 660 0 1 Newell, Mr. Arthur Webster male 58.0 0 2 35273 113.2750 D48 C
679 680 1 1 Cardeza, Mr. Thomas Drake Martinez male 36.0 0 1 PC 17755 512.3292 B51 B53 B55 C
681 682 1 1 Hassab, Mr. Hammad male 27.0 0 0 PC 17572 76.7292 D49 C
698 699 0 1 Thayer, Mr. John Borland male 49.0 1 1 17421 110.8833 C68 C
700 701 1 1 Astor, Mrs. John Jacob (Madeleine Talmadge Force) female 18.0 1 0 PC 17757 227.5250 C62 C64 C
710 711 1 1 Mayne, Mlle. Berthe Antonine ("Mrs de Villiers") female 24.0 0 0 PC 17482 49.5042 C90 C
716 717 1 1 Endres, Miss. Caroline Louise female 38.0 0 0 PC 17757 227.5250 C45 C
737 738 1 1 Lesurer, Mr. Gustave J male 35.0 0 0 PC 17755 512.3292 B101 C
742 743 1 1 Ryerson, Miss. Susan Parker "Suzette" female 21.0 2 2 PC 17608 262.3750 B57 B59 B63 B66 C
789 790 0 1 Guggenheim, Mr. Benjamin male 46.0 0 0 PC 17593 79.2000 B82 B84 C
835 836 1 1 Compton, Miss. Sara Rebecca female 39.0 1 1 PC 17756 83.1583 E49 C
879 880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 1 11767 83.1583 C50 C
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C

Comments: not all passengers from Cherbourg are Gallic (🇫🇷: gaulois) …

What is the survival rate for raw data?

df_titanic_raw['Survived'].mean()
0.3838383838383838

What is the survival rate for data after removing missing values?

df_titanic['Survived'].mean()
0.6721311475409836

See also the command:

df_titanic.groupby(['Sex'])[['Survived', 'Age', 'Fare']].mean()
Survived Age Fare
Sex
female 0.931818 32.676136 89.000900
male 0.431579 38.451789 69.124343

Conclusion: Be careful when you remove some missing values, the missingness might be informative!

EXERCISE: More data analysis

What was the proportion of women on the boat?

Data import/export

The Pandas library supports many formats:

  • CSV, text
  • SQL database
  • Excel
  • HDF5
  • JSON
  • HTML
  • pickle
  • sas, stata

Exploration

pd.options.display.max_rows = 8
df_titanic_raw.tail()
Name PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
886 Montvila, Rev. Juozas 887 0 2 male 27.0 0 0 211536 13.00 NaN S
887 Graham, Miss. Margaret Edith 888 1 1 female 19.0 0 0 112053 30.00 B42 S
888 Johnston, Miss. Catherine Helen "Carrie" 889 0 3 female NaN 1 2 W./C. 6607 23.45 NaN S
889 Behr, Mr. Karl Howell 890 1 1 male 26.0 0 0 111369 30.00 C148 C
890 Dooley, Mr. Patrick 891 0 3 male 32.0 0 0 370376 7.75 NaN Q
df_titanic_raw.head()
Name PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 Braund, Mr. Owen Harris 1 0 3 male 22.0 1 0 A/5 21171 7.2500 NaN S
1 Cumings, Mrs. John Bradley (Florence Briggs Th... 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85 C
2 Heikkinen, Miss. Laina 3 1 3 female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
4 Allen, Mr. William Henry 5 0 3 male 35.0 0 0 373450 8.0500 NaN S

Access values by line/columns etc.

  • iloc
df_titanic_raw.iloc[0:2, 1:8]
PassengerId Survived Pclass Sex Age SibSp Parch
0 1 0 3 male 22.0 1 0
1 2 1 1 female 38.0 1 0
  • loc
# with original index:
# df_titanic_raw.loc[128]

# with naming indexing
df_titanic_raw = df_titanic_raw.set_index('Name')  # You can only do it once !!
df_titanic_raw.loc['Bonnell, Miss. Elizabeth', 'Fare']
26.55
df_titanic_raw.loc['Bonnell, Miss. Elizabeth']
PassengerId        12
Survived            1
Pclass              1
Sex            female
                ...  
Ticket         113783
Fare            26.55
Cabin            C103
Embarked            S
Name: Bonnell, Miss. Elizabeth, Length: 11, dtype: object
df_titanic_raw.loc['Bonnell, Miss. Elizabeth', 'Survived']
df_titanic_raw.loc['Bonnell, Miss. Elizabeth', 'Survived'] = 0
df_titanic_raw.loc['Bonnell, Miss. Elizabeth']
PassengerId        12
Survived            0
Pclass              1
Sex            female
                ...  
Ticket         113783
Fare            26.55
Cabin            C103
Embarked            S
Name: Bonnell, Miss. Elizabeth, Length: 11, dtype: object
# set back the original value
df_titanic_raw.loc['Bonnell, Miss. Elizabeth', 'Survived'] = 1
df_titanic_raw = df_titanic_raw.reset_index()  # Back to original indexing

groupby

df_titanic.groupby(['Sex'])[['Survived', 'Age', 'Fare', 'Pclass']].mean()
Survived Age Fare Pclass
Sex
female 0.931818 32.676136 89.000900 1.215909
male 0.431579 38.451789 69.124343 1.168421

Create binned values

bins=np.arange(0, 100, 10)
current_palette = sns.color_palette()

df_test = pd.DataFrame({ 'Age': pd.cut(df_titanic['Age'], bins=bins, right=False)})
ax = sns.countplot(data=df_test, x='Age', color=current_palette[0])
ax.tick_params(axis='x', labelrotation=90)

Second case study: air quality in Paris (Source: Airparif)

url = "http://josephsalmon.eu/enseignement/datasets/20080421_20160927-PA13_auto.csv"
path_target = "./20080421_20160927-PA13_auto.csv"
path, fname = os.path.split(path_target)
pooch.retrieve(url, path=path, fname=fname, known_hash=None)
'/home/bcharlier/enseignement/HAX712/HAX712X/Courses/Pandas/20080421_20160927-PA13_auto.csv'

You can run for instance in a terminal:

!head -26 ./20080421_20160927-PA13_auto.csv

Alternatively:

from IPython import get_ipython
get_ipython().system('head -26 ./20080421_20160927-PA13_auto.csv')

References:

polution_df = pd.read_csv('20080421_20160927-PA13_auto.csv', sep=';',
                          comment='#',
                          na_values="n/d",
                          converters={'heure': str})
pd.options.display.max_rows = 30
polution_df.head(25)
date heure NO2 O3
0 21/04/2008 1 13.0 74.0
1 21/04/2008 2 11.0 73.0
2 21/04/2008 3 13.0 64.0
3 21/04/2008 4 23.0 46.0
4 21/04/2008 5 47.0 24.0
5 21/04/2008 6 70.0 11.0
6 21/04/2008 7 70.0 17.0
7 21/04/2008 8 76.0 16.0
8 21/04/2008 9 NaN NaN
9 21/04/2008 10 NaN NaN
10 21/04/2008 11 NaN NaN
11 21/04/2008 12 33.0 60.0
12 21/04/2008 13 31.0 61.0
13 21/04/2008 14 37.0 61.0
14 21/04/2008 15 20.0 78.0
15 21/04/2008 16 29.0 71.0
16 21/04/2008 17 30.0 70.0
17 21/04/2008 18 38.0 58.0
18 21/04/2008 19 52.0 40.0
19 21/04/2008 20 56.0 29.0
20 21/04/2008 21 39.0 40.0
21 21/04/2008 22 31.0 42.0
22 21/04/2008 23 29.0 42.0
23 21/04/2008 24 28.0 36.0
24 22/04/2008 1 46.0 16.0

Data preprocessing

# check types
polution_df.dtypes

# check all
polution_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73920 entries, 0 to 73919
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   date    73920 non-null  object 
 1   heure   73920 non-null  object 
 2   NO2     71008 non-null  float64
 3   O3      71452 non-null  float64
dtypes: float64(2), object(2)
memory usage: 2.3+ MB

For more info on the nature of Pandas objects, see this discussion on Stackoverflow. Moreover, things are slowly moving from numpy to pyarrow, cf. Pandas user guide

EXERCISE: handling missing values

What is the meaning of “na_values=”n/d” above? Note that an alternative can be obtained with the command polution_df.replace('n/d', np.nan, inplace=True)

Issues with non-conventional hours/day format

Start by changing to integer type (e.g., int8):

polution_df['heure'] = polution_df['heure'].astype(np.int8)
polution_df['heure']
0         1
1         2
2         3
3         4
4         5
         ..
73915    20
73916    21
73917    22
73918    23
73919    24
Name: heure, Length: 73920, dtype: int8

No data is from 1 to 24… not conventional so let’s make it from 0 to 23

polution_df['heure'] = polution_df['heure'] - 1
polution_df['heure']
0         0
1         1
2         2
3         3
4         4
         ..
73915    19
73916    20
73917    21
73918    22
73919    23
Name: heure, Length: 73920, dtype: int8

and back to strings:

polution_df['heure'] = polution_df['heure'].astype('str')
polution_df['heure']
0         0
1         1
2         2
3         3
4         4
         ..
73915    19
73916    20
73917    21
73918    22
73919    23
Name: heure, Length: 73920, dtype: object

Time processing

Note that we have used the following conventions: - d = day - m=month - Y=year - H=hour - M=minutes

time_improved = pd.to_datetime(polution_df['date'] +
                               ' ' + polution_df['heure'] + ':00',
                               format='%d/%m/%Y %H:%M')

time_improved
0       2008-04-21 00:00:00
1       2008-04-21 01:00:00
2       2008-04-21 02:00:00
3       2008-04-21 03:00:00
4       2008-04-21 04:00:00
                ...        
73915   2016-09-27 19:00:00
73916   2016-09-27 20:00:00
73917   2016-09-27 21:00:00
73918   2016-09-27 22:00:00
73919   2016-09-27 23:00:00
Length: 73920, dtype: datetime64[ns]
polution_df['date'] + ' ' + polution_df['heure'] + ':00'
0         21/04/2008 0:00
1         21/04/2008 1:00
2         21/04/2008 2:00
3         21/04/2008 3:00
4         21/04/2008 4:00
               ...       
73915    27/09/2016 19:00
73916    27/09/2016 20:00
73917    27/09/2016 21:00
73918    27/09/2016 22:00
73919    27/09/2016 23:00
Length: 73920, dtype: object

Create correct timing format in the dataframe

polution_df['DateTime'] = time_improved
 # remove useless columns:
del polution_df['heure']
del polution_df['date']
polution_df
NO2 O3 DateTime
0 13.0 74.0 2008-04-21 00:00:00
1 11.0 73.0 2008-04-21 01:00:00
2 13.0 64.0 2008-04-21 02:00:00
3 23.0 46.0 2008-04-21 03:00:00
4 47.0 24.0 2008-04-21 04:00:00
... ... ... ...
73915 55.0 31.0 2016-09-27 19:00:00
73916 85.0 5.0 2016-09-27 20:00:00
73917 75.0 9.0 2016-09-27 21:00:00
73918 64.0 15.0 2016-09-27 22:00:00
73919 57.0 14.0 2016-09-27 23:00:00

73920 rows × 3 columns

Visualize the data set now that the time is well formatted:

polution_ts = polution_df.set_index(['DateTime'])
polution_ts = polution_ts.sort_index(ascending=True)
polution_ts.head(12)
NO2 O3
DateTime
2008-04-21 00:00:00 13.0 74.0
2008-04-21 01:00:00 11.0 73.0
2008-04-21 02:00:00 13.0 64.0
2008-04-21 03:00:00 23.0 46.0
2008-04-21 04:00:00 47.0 24.0
2008-04-21 05:00:00 70.0 11.0
2008-04-21 06:00:00 70.0 17.0
2008-04-21 07:00:00 76.0 16.0
2008-04-21 08:00:00 NaN NaN
2008-04-21 09:00:00 NaN NaN
2008-04-21 10:00:00 NaN NaN
2008-04-21 11:00:00 33.0 60.0
polution_ts.describe()
NO2 O3
count 71008.000000 71452.000000
mean 34.453414 39.610046
std 20.380702 28.837333
min 1.000000 0.000000
25% 19.000000 16.000000
50% 30.000000 38.000000
75% 46.000000 58.000000
max 167.000000 211.000000
fig, axes = plt.subplots(2, 1, figsize=(6, 6), sharex=True)

axes[0].plot(polution_ts['O3'])
axes[0].set_title("Ozone polution: daily average in Paris")
axes[0].set_ylabel("Concentration (µg/m³)")

axes[1].plot(polution_ts['NO2'])
axes[1].set_title("Nitrogen polution: daily average in Paris")
axes[1].set_ylabel("Concentration (µg/m³)")
plt.show()

fig, axes = plt.subplots(2, 1, figsize=(10, 5), sharex=True)

axes[0].plot(polution_ts['O3'].resample('d').max(), '--')
axes[0].plot(polution_ts['O3'].resample('d').min(),'-.')

axes[0].set_title("Ozone polution: daily average in Paris")
axes[0].set_ylabel("Concentration (µg/m³)")

axes[1].plot(polution_ts['NO2'].resample('d').max(),  '--')
axes[1].plot(polution_ts['NO2'].resample('d').min(),  '-.')

axes[1].set_title("Nitrogen polution: daily average in Paris")
axes[1].set_ylabel("Concentration (µg/m³)")

plt.show()

Source: https://www.tutorialspoint.com/python/time_strptime.htm

EXERCISE: extreme values per day

Provide the same plots as before, but with daily best and worst on the same figures (and use different colors and/or styles)

Q: Is the pollution getting better over the years or not?

fig, ax = plt.subplots(1, 1)
polution_ts['2008':].resample('Y').mean().plot(ax=ax)
# Sample by year (A pour Annual) or Y for Year
plt.ylim(0, 50)
plt.title("Pollution evolution: \n yearly average in Paris")
plt.ylabel("Concentration (µg/m³)")
plt.xlabel("Year")
plt.show()

Loading colors:

sns.set_palette("GnBu_d", n_colors=7)
polution_ts['weekday'] = polution_ts.index.weekday  # Monday=0, Sunday=6
polution_ts['weekend'] = polution_ts['weekday'].isin([5, 6])

polution_week_no2 = polution_ts.groupby(['weekday', polution_ts.index.hour])[
    'NO2'].mean().unstack(level=0)
polution_week_03 = polution_ts.groupby(['weekday', polution_ts.index.hour])[
    'O3'].mean().unstack(level=0)
plt.show()
fig, axes = plt.subplots(2, 1, figsize=(7, 7), sharex=True)

polution_week_no2.plot(ax=axes[0])
axes[0].set_ylabel("Concentration (µg/m³)")
axes[0].set_xlabel("Intraday evolution")
axes[0].set_title(
    "Daily NO2 concentration: weekend effect?")
axes[0].set_xticks(np.arange(0, 24))
axes[0].set_xticklabels(np.arange(0, 24), rotation=45)
axes[0].set_ylim(0, 60)

polution_week_03.plot(ax=axes[1])
axes[1].set_ylabel("Concentration (µg/m³)")
axes[1].set_xlabel("Intraday evolution")
axes[1].set_title("Daily O3 concentration: weekend effect?")
axes[1].set_xticks(np.arange(0, 24))
axes[1].set_xticklabels(np.arange(0, 24), rotation=45)
axes[1].set_ylim(0, 70)
axes[0].legend().set_visible(False)
# ax.legend()
axes[1].legend(labels=[day for day in calendar.day_name], loc='lower left', bbox_to_anchor=(1, 0.1))

plt.tight_layout()
plt.show()

polution_ts['month'] = polution_ts.index.month  # Janvier=0, .... Decembre=11
polution_ts['month'] = polution_ts['month'].apply(lambda x:
                                                  calendar.month_abbr[x])
polution_ts.head()
NO2 O3 weekday weekend month
DateTime
2008-04-21 00:00:00 13.0 74.0 0 False Apr
2008-04-21 01:00:00 11.0 73.0 0 False Apr
2008-04-21 02:00:00 13.0 64.0 0 False Apr
2008-04-21 03:00:00 23.0 46.0 0 False Apr
2008-04-21 04:00:00 47.0 24.0 0 False Apr
polution_month_no2 = polution_ts.groupby(['month', polution_ts.index.hour])[
    'NO2'].mean().unstack(level=0)
polution_month_03 = polution_ts.groupby(['month', polution_ts.index.hour])[
    'O3'].mean().unstack(level=0)
sns.set_palette("Paired", n_colors=12)

fig, axes = plt.subplots(2, 1, figsize=(7, 7), sharex=True)

polution_month_no2.plot(ax=axes[0])
axes[0].set_ylabel("Concentration (µg/m³)")
axes[0].set_xlabel("Hour of the day")
axes[0].set_title(
    "Daily profile per month (NO2): weekend effect?")
axes[0].set_xticks(np.arange(0, 24))
axes[0].set_xticklabels(np.arange(0, 24), rotation=45)
axes[0].set_ylim(0, 90)

polution_month_03.plot(ax=axes[1])
axes[1].set_ylabel("Concentration (µg/m³)")
axes[1].set_xlabel("Heure de la journée")
axes[1].set_title("Daily profile per month (O3): weekend effect?")
axes[1].set_xticks(np.arange(0, 24))
axes[1].set_xticklabels(np.arange(0, 24), rotation=45)
axes[1].set_ylim(0, 90)
axes[0].legend().set_visible(False)
# ax.legend()
axes[1].legend(labels=calendar.month_name[1:], loc='lower left',
               bbox_to_anchor=(1, 0.1))
plt.tight_layout()
plt.show()

Third case study: explore a dataset on bike accidents in France

References:

url = "https://koumoul.com/s/data-fair/api/v1/datasets/accidents-velos/raw"
path_target = "./bicycle_db.csv"
path, fname = os.path.split(path_target)
pooch.retrieve(url, path=path, fname=fname, known_hash=None)
'/home/bcharlier/enseignement/HAX712/HAX712X/Courses/Pandas/bicycle_db.csv'
# df: data frame
df_bikes = pd.read_csv("bicycle_db.csv", na_values="", low_memory=False,
                       dtype={'data': str, 'heure': str, 'departement': str})

In June 2023, the author decided to change the name of the columns, hence we had to define a dictionary to come back to legacy names:

new2old = {
"hrmn": "heure",
"secuexist": "existence securite",
"grav": "gravite accident",
"dep": "departement"
}

df_bikes.rename(columns=new2old, inplace=True)
get_ipython().system('head -5 ./bicycle_db.csv')
pd.options.display.max_columns = 40
df_bikes.head()
Num_Acc date an mois jour heure departement com lat long agg int col lum atm catr circ nbv prof plan lartpc larrout surf infra situ gravite accident sexe age trajet existence securite equipement obs obsm choc manv vehiculeid typevehicules manoeuvehicules numVehicules
0 200500000030 2005-01-13 2005 janvier jeudi 19:45 62 62331 50.3 2.84 2 1 3.0 5 1.0 3 2.0 0.0 -1.0 1.0 000 050 1.0 0.0 1.0 4 1 58.0 5.0 0 0 0.0 2.0 8.0 11.0 200500000030B02 18 17 1.0
1 200500000034 2005-01-19 2005 janvier mercredi 10:45 62 62022 0 0 1 1 1.0 1 7.0 3 2.0 0.0 1.0 3.0 000 050 1.0 0.0 1.0 3 1 20.0 5.0 0 0 0.0 2.0 1.0 1.0 200500000034B02 10 15 1.0
2 200500000078 2005-01-26 2005 janvier mercredi 13:15 02 02173 0 0 1 9 3.0 1 1.0 3 2.0 2.0 2.0 1.0 000 000 1.0 0.0 1.0 4 1 71.0 5.0 1 2 0.0 2.0 1.0 1.0 200500000078B02 7 15 1.0
3 200500000093 2005-01-03 2005 janvier lundi 13:30 02 02810 49.255 3.094 2 1 1.0 1 1.0 3 2.0 0.0 1.0 2.0 000 052 1.0 0.0 1.0 3 2 51.0 4.0 0 0 0.0 2.0 3.0 21.0 200500000093B02 7 21 1.0
4 200500000170 2005-01-29 2005 janvier samedi 18:30 76 76196 0 0 1 1 2.0 3 1.0 3 2.0 2.0 1.0 1.0 000 050 1.0 0.0 1.0 4 1 74.0 5.0 1 9 0.0 2.0 4.0 2.0 200500000170A01 10 2 1.0
df_bikes['existence securite'].unique()
array([0, 1, 2])
df_bikes['gravite accident'].unique()
array([4, 3, 2, 1])

Handle missing values in heure

df_bikes['date'].hasnans
df_bikes['heure'].hasnans
False
pd.options.display.max_rows = 20
df_bikes.iloc[400:402]
Num_Acc date an mois jour heure departement com lat long agg int col lum atm catr circ nbv prof plan lartpc larrout surf infra situ gravite accident sexe age trajet existence securite equipement obs obsm choc manv vehiculeid typevehicules manoeuvehicules numVehicules
400 200500008935 2005-02-13 2005 février dimanche 37: 75 75018 0 0 2 2 3.0 5 6.0 4 2.0 4.0 1.0 1.0 000 120 1.0 0.0 1.0 4 1 33.0 1.0 1 2 0.0 2.0 3.0 7.0 200500008935B01 7 16 1.0
401 200500008941 2005-02-14 2005 février lundi 15:05 75 75007 0 0 2 1 3.0 1 1.0 4 2.0 4.0 1.0 1.0 000 120 1.0 0.0 1.0 4 2 22.0 5.0 1 2 0.0 2.0 2.0 1.0 200500008941A01 7 9 1.0

Remove missing hours cases by np.nan:

df_bikes['heure'] = df_bikes['heure'].replace('', np.nan)
df_bikes.iloc[400:402]
Num_Acc date an mois jour heure departement com lat long agg int col lum atm catr circ nbv prof plan lartpc larrout surf infra situ gravite accident sexe age trajet existence securite equipement obs obsm choc manv vehiculeid typevehicules manoeuvehicules numVehicules
400 200500008935 2005-02-13 2005 février dimanche 37: 75 75018 0 0 2 2 3.0 5 6.0 4 2.0 4.0 1.0 1.0 000 120 1.0 0.0 1.0 4 1 33.0 1.0 1 2 0.0 2.0 3.0 7.0 200500008935B01 7 16 1.0
401 200500008941 2005-02-14 2005 février lundi 15:05 75 75007 0 0 2 1 3.0 1 1.0 4 2.0 4.0 1.0 1.0 000 120 1.0 0.0 1.0 4 2 22.0 5.0 1 2 0.0 2.0 2.0 1.0 200500008941A01 7 9 1.0
df_bikes.dropna(subset=['heure'], inplace=True)
df_bikes.iloc[399:402]
Num_Acc date an mois jour heure departement com lat long agg int col lum atm catr circ nbv prof plan lartpc larrout surf infra situ gravite accident sexe age trajet existence securite equipement obs obsm choc manv vehiculeid typevehicules manoeuvehicules numVehicules
399 200500008875 2005-02-10 2005 février jeudi 15:00 75 75016 0 0 2 1 2.0 1 1.0 4 2.0 4.0 1.0 1.0 000 120 1.0 0.0 1.0 4 1 54.0 5.0 0 0 0.0 2.0 4.0 1.0 200500008875B01 1 1 1.0
400 200500008935 2005-02-13 2005 février dimanche 37: 75 75018 0 0 2 2 3.0 5 6.0 4 2.0 4.0 1.0 1.0 000 120 1.0 0.0 1.0 4 1 33.0 1.0 1 2 0.0 2.0 3.0 7.0 200500008935B01 7 16 1.0
401 200500008941 2005-02-14 2005 février lundi 15:05 75 75007 0 0 2 1 3.0 1 1.0 4 2.0 4.0 1.0 1.0 000 120 1.0 0.0 1.0 4 2 22.0 5.0 1 2 0.0 2.0 2.0 1.0 200500008941A01 7 9 1.0

::: {.callout-important appearance=‘default’ icon=“false”} ## EXERCISE: start/end of the study

Can you find the starting day and the ending day of the study automatically?

Hint: Sort the data! You can sort the data by time for instance, say with df.sort('Time').

df_bikes['date'] + ' ' + df_bikes['heure']
0        2005-01-13 19:45
1        2005-01-19 10:45
2        2005-01-26 13:15
3        2005-01-03 13:30
4        2005-01-29 18:30
               ...       
74753    2021-01-02 18:30
74754    2021-01-04 08:20
74755    2021-01-01 16:55
74756    2021-01-02 15:40
74757    2021-01-01 12:20
Length: 74758, dtype: object
# ADAPT OLD to create the df_bikes['Time']

time_improved = pd.to_datetime(df_bikes['date'] +
                               ' ' + df_bikes['heure'],
                               format='%Y-%m-%d %H',
                               errors='coerce')

# Where d = day, m=month, Y=year, H=hour, M=minutes
df_bikes['Time'] = time_improved
# remove rows with NaT
df_bikes.dropna(subset=["Time"], inplace=True)
# set new index 
df_bikes.set_index('Time', inplace=True)
# remove useless columns
df_bikes.drop(columns=['heure', 'date'], inplace=True)
df_bikes.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 61124 entries, 2005-01-13 19:45:00 to 2021-01-01 12:20:00
Data columns (total 37 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Num_Acc             61124 non-null  int64  
 1   an                  61124 non-null  int64  
 2   mois                61124 non-null  object 
 3   jour                61124 non-null  object 
 4   departement         61124 non-null  object 
 5   com                 61124 non-null  object 
 6   lat                 61124 non-null  object 
 7   long                60904 non-null  object 
 8   agg                 61124 non-null  int64  
 9   int                 61124 non-null  int64  
 10  col                 61121 non-null  float64
 11  lum                 61124 non-null  int64  
 12  atm                 61121 non-null  float64
 13  catr                61124 non-null  int64  
 14  circ                61002 non-null  float64
 15  nbv                 60959 non-null  float64
 16  prof                60977 non-null  float64
 17  plan                60962 non-null  float64
 18  lartpc              50586 non-null  object 
 19  larrout             56446 non-null  object 
 20  surf                60980 non-null  float64
 21  infra               60689 non-null  float64
 22  situ                60717 non-null  float64
 23  gravite accident    61124 non-null  int64  
 24  sexe                61124 non-null  int64  
 25  age                 61105 non-null  float64
 26  trajet              61121 non-null  float64
 27  existence securite  61124 non-null  int64  
 28  equipement          57383 non-null  object 
 29  obs                 61102 non-null  float64
 30  obsm                61089 non-null  float64
 31  choc                61113 non-null  float64
 32  manv                61110 non-null  float64
 33  vehiculeid          61124 non-null  object 
 34  typevehicules       52392 non-null  object 
 35  manoeuvehicules     52377 non-null  object 
 36  numVehicules        52392 non-null  float64
dtypes: float64(16), int64(9), object(12)
memory usage: 17.7+ MB
df_bike2 = df_bikes.loc[
    :, ["gravite accident", "existence securite", "age", "sexe"]
]
df_bike2["existence securite"].replace({"Inconnu": np.nan}, inplace=True)
df_bike2.dropna(inplace=True)
EXERCISE: Is the helmet saving your life?

Perform an analysis so that you can check the benefit or not of wearing a helmet to save your life. Beware: Preprocessing is needed to use pd.crosstab, pivot_table to avoid issues.

age
existence securite 0 1 2 All
gravite accident sexe
1 -1 NaN 1.0 NaN 1
1 203.0 3197.0 229.0 3629
2 48.0 497.0 37.0 582
2 1 190.0 1555.0 177.0 1922
2 60.0 254.0 28.0 342
3 1 1383.0 13888.0 902.0 16173
2 429.0 3331.0 218.0 3978
4 1 1063.0 22313.0 1561.0 24937
2 515.0 8437.0 589.0 9541
All 3891.0 53473.0 3741.0 61105
gravite accident 1 2 3 4
existence securite
0 6.450784 6.425084 46.569005 40.555127
1 6.910029 3.383016 32.201298 57.505657
2 7.110398 5.479818 29.938519 57.471264
gravite accident 1 2 3 4
existence securite
0 6.450784 6.425084 46.569005 40.555127
1 6.910029 3.383016 32.201298 57.505657
2 7.110398 5.479818 29.938519 57.471264
EXERCISE: Are men and women dying equally on a bike?

Perform an analysis to check differences between men’s and women’s survival.

Series([], dtype: float64)
sexe
-1    0.000016
 1    0.763563
 2    0.236421
dtype: float64
gravite accident 1 2 3 4 All
sexe
-1 0.023742 0.000000 0.000000 0.000000 0.001637
1 86.158594 84.893993 80.259044 72.327281 76.362000
2 13.817664 15.106007 19.740956 27.672719 23.636364

To conclude

Note that in the dataset, the information on the level of bike practice by gender is missing.

EXERCISE: Accident during the week?

Perform an analysis to check when the accidents are occurring during the week.

df_bikes
Num_Acc an mois jour departement com lat long agg int col lum atm catr circ nbv prof plan lartpc larrout surf infra situ gravite accident sexe age trajet existence securite equipement obs obsm choc manv vehiculeid typevehicules manoeuvehicules numVehicules
Time
2005-01-13 19:45:00 200500000030 2005 janvier jeudi 62 62331 50.3 2.84 2 1 3.0 5 1.0 3 2.0 0.0 -1.0 1.0 000 050 1.0 0.0 1.0 4 1 58.0 5.0 0 0 0.0 2.0 8.0 11.0 200500000030B02 18 17 1.0
2005-01-19 10:45:00 200500000034 2005 janvier mercredi 62 62022 0 0 1 1 1.0 1 7.0 3 2.0 0.0 1.0 3.0 000 050 1.0 0.0 1.0 3 1 20.0 5.0 0 0 0.0 2.0 1.0 1.0 200500000034B02 10 15 1.0
2005-01-26 13:15:00 200500000078 2005 janvier mercredi 02 02173 0 0 1 9 3.0 1 1.0 3 2.0 2.0 2.0 1.0 000 000 1.0 0.0 1.0 4 1 71.0 5.0 1 2 0.0 2.0 1.0 1.0 200500000078B02 7 15 1.0
2005-01-03 13:30:00 200500000093 2005 janvier lundi 02 02810 49.255 3.094 2 1 1.0 1 1.0 3 2.0 0.0 1.0 2.0 000 052 1.0 0.0 1.0 3 2 51.0 4.0 0 0 0.0 2.0 3.0 21.0 200500000093B02 7 21 1.0
2005-01-29 18:30:00 200500000170 2005 janvier samedi 76 76196 0 0 1 1 2.0 3 1.0 3 2.0 2.0 1.0 1.0 000 050 1.0 0.0 1.0 4 1 74.0 5.0 1 9 0.0 2.0 4.0 2.0 200500000170A01 10 2 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2021-01-02 18:30:00 202100056317 2021 janvier samedi 44 44168 47,3777890000 -2,1976410000 2 1 3.0 5 1.0 3 2.0 2.0 1.0 2.0 NaN -1 1.0 0.0 1.0 4 1 35.0 5.0 2 NaN 0.0 0.0 8.0 1.0 202100056317B01 7 14 1.0
2021-01-04 08:20:00 202100056362 2021 janvier lundi 64 64138 43,2309460000 -0,2765840000 2 3 1.0 1 1.0 3 2.0 2.0 1.0 1.0 NaN -1 1.0 0.0 1.0 4 1 53.0 1.0 1 2 0.0 2.0 0.0 1.0 202100056362B01 7 15 1.0
2021-01-01 16:55:00 202100056404 2021 janvier vendredi 54 54395 48,6849869839 6,1760189384 2 2 3.0 5 1.0 4 1.0 2.0 1.0 1.0 NaN -1 1.0 -1.0 1.0 4 1 12.0 5.0 1 2 0.0 2.0 1.0 25.0 202100056404A01 7 1 1.0
2021-01-02 15:40:00 202100056424 2021 janvier samedi 75 75110 48,8769050000 2,3665940000 2 1 3.0 1 8.0 4 1.0 1.0 2.0 1.0 NaN -1 1.0 -1.0 5.0 4 2 43.0 5.0 2 NaN 0.0 2.0 1.0 1.0 202100056424A01 7 9 1.0
2021-01-01 12:20:00 202100056508 2021 janvier vendredi 64 64400 43,1465730000 -0,1955440000 1 1 7.0 1 1.0 4 2.0 2.0 2.0 2.0 NaN -1 2.0 0.0 1.0 3 1 46.0 5.0 1 2/6 0.0 0.0 8.0 1.0 202100056508A01 NaN NaN NaN

61124 rows × 37 columns

# Chargement des couleurs
sns.set_palette("GnBu_d", n_colors=7)

df_bikes['weekday'] = df_bikes.index.day_of_week  # Monday=0, Sunday=6

accidents_week = df_bikes.groupby(['weekday', df_bikes.index.hour])[
    'sexe'].count().unstack(level=0)

fig, axes = plt.subplots(1, 1, figsize=(7, 7))
accidents_week.plot(ax=axes)
axes.set_ylabel("Accidents")
axes.set_xlabel("Heure de la journée")
axes.set_title(
    "Profil journalier des accidents: effet du weekend?")
axes.set_xticks(np.arange(0, 24))
axes.set_xticklabels(np.arange(0, 24), rotation=45)
axes.legend(
    labels=[day for day in calendar.day_name],
    loc='upper left',
    )
plt.tight_layout()
plt.show()

df_bikes.groupby(['weekday', df_bikes.index.hour])[
    'sexe'].count()
weekday  Time
0        0         4
         1         4
         3         3
         4         3
         5         6
                ... 
6        19      385
         20      260
         21      159
         22       92
         23       80
Name: sexe, Length: 164, dtype: int64
EXERCISE: Accident during the year

Perform an analysis to check when the accidents are occurring during the week.

df_bikes['month'] = df_bikes.index.month  # Janvier=0, .... Decembre=11
df_bikes['month'] = df_bikes['month'].apply(lambda x: calendar.month_abbr[x])
df_bikes.head()

sns.set_palette("GnBu_d", n_colors=12)  # sns.set_palette("colorblind",...)

df_bikes_month = df_bikes.groupby(['month', df_bikes.index.hour])[
    'age'].count().unstack(level=0)

fig, axes = plt.subplots(1, 1, figsize=(7, 7), sharex=True)

df_bikes_month.plot(ax=axes)
axes.set_ylabel("Concentration (µg/m³)")
axes.set_xlabel("Heure de la journée")
axes.set_title(
    "Profil journalier de la pollution au NO2: effet du weekend?")
axes.set_xticks(np.arange(0, 24))
axes.set_xticklabels(np.arange(0, 24), rotation=45)
axes.legend(labels=calendar.month_name[1:], loc='upper left')

plt.tight_layout()
plt.show()

EXERCISE: Accidents by department

Perform an analysis to check when the accidents are occurring for each department, relative to population size.

path_target = "./dpt_population.csv"
url = "https://public.opendatasoft.com/explore/dataset/population-francaise-par-departement-2018/download/?format=csv&timezone=Europe/Berlin&lang=en&use_labels_for_header=true&csv_separator=%3B"
path, fname = os.path.split(path_target)
pooch.retrieve(url, path=path, fname=fname, known_hash=None)
'/home/bcharlier/enseignement/HAX712/HAX712X/Courses/Pandas/dpt_population.csv'
df_dtp_pop = pd.read_csv("dpt_population.csv", sep=";", low_memory=False)

df_dtp_pop['Code Département'].replace('2A', '20A',inplace=True)
df_dtp_pop['Code Département'].replace('2B', '20B',inplace=True)
df_dtp_pop.sort_values(by=['Code Département'], inplace=True)

df_bikes['departement'].replace('2A', '20A',inplace=True)
df_bikes['departement'].replace('2B', '20B',inplace=True)
df_bikes.sort_values(by=['departement'], inplace=True)
# Clean extra departements
df_bikes = df_bikes.loc[df_bikes['departement'].isin(df_dtp_pop['Code Département'].unique())]

gd = df_bikes.groupby(['departement'], as_index=True, sort=True).size()

data = {'code': gd.index,
        '# Accidents per million': gd.values}
df = pd.DataFrame(data)
df['# Accidents per million'] = df['# Accidents per million'].values * 10000./ df_dtp_pop['Population'].values
path_target = "./departements-avec-outre-mer.geojson"
url = "https://raw.githubusercontent.com/gregoiredavid/france-geojson/master/departements-avec-outre-mer.geojson"
path, fname = os.path.split(path_target)
pooch.retrieve(url, path=path, fname=fname, known_hash=None)
'/home/bcharlier/enseignement/HAX712/HAX712X/Courses/Pandas/departements-avec-outre-mer.geojson'
import plotly.express as px
import geopandas

departement = geopandas.read_file('departements-avec-outre-mer.geojson')
departement['code'].replace('2A', '20A', inplace=True)
departement['code'].replace('2B', '20B', inplace=True)

departement.sort_values(by=['code'], inplace=True)

a = ['0'+ str(i) for i in range(1, 10)]
b = [str(i) for i in range(1, 10)]
dict_replace = dict(zip(a, b))

departement['code'].replace(dict_replace, inplace=True)
df['code'].replace(dict_replace, inplace=True)

departement['code'].replace('20A', '2A', inplace=True)
departement['code'].replace('20B', '2B', inplace=True)
df['code'].replace('20A', '2A',inplace=True)
df['code'].replace('20B', '2B',inplace=True)

departement.set_index('code', inplace=True)

fig = px.choropleth_mapbox(
    df,
    geojson=departement,
    locations="code",
    color="# Accidents per million",
    range_color=(0, df['# Accidents per million'].max()),
    color_continuous_scale="rdbu",
    center={'lat': 47, 'lon': 2},
    zoom=3.25,
    mapbox_style="white-bg",
)
fig.update_traces(colorbar_orientation='h', selector=dict(type='choroplethmapbox'))
fig.update_layout(
    title_text = 'Accidents per million inhabitants by department',
)
fig.layout.coloraxis.colorbar.thickness = 20
fig.layout.coloraxis.colorbar.orientation = 'h'
fig.layout.coloraxis.colorbar.y = -0.2
fig.layout.coloraxis.colorbar.x = 0.2

fig.show()
EXERCISE: Accidents by department

Perform an analysis to check when the accidents are occurring for each department, relative to the area of the departements.

References:

Back to top